Adult¶
O dataset a ser analisado chama-se ADULT, e está disponível em https://archive.ics.uci.edu/dataset/2/adult.
Este dataset é bastante utilizado por estudantes de dados, principalmente no contexto de prever se a renda ultrapassa $50K/ano baseando-se em dados do censo americano, motivo pelo qual também pode ser encontrado como "Census Income Dataset".
Os dados foram coletados em 1994, e o dataset possui 48842 linhas e 14 colunas.
Obtendo e Carregando os dados¶
Como os dados estão disponíveis online, ele será baixado diretamente para o notebook.
%%bash
## Verica se existe o arquivo adult.zip e baixa do UCI, se necessário:
[ -e adult.zip ] && echo zip encontrado || wget https://archive.ics.uci.edu/static/public/2/adult.zip
## Verifica se o arquivo já foi descompactado e descompacta se necessário:
[ -e adult ] && echo zip descompactado || unzip adult.zip -d adult
ls -lah adult
zip encontrado zip descompactado total 5.8M drwxr-xr-x 1 wesley wesley 102 Sep 24 00:46 . drwxr-xr-x 1 wesley wesley 166 Nov 23 22:32 .. -rwx------ 1 wesley wesley 3.8M May 22 2023 adult.data -rwx------ 1 wesley wesley 5.2K May 22 2023 adult.names -rwx------ 1 wesley wesley 2.0M May 22 2023 adult.test -rwx------ 1 wesley wesley 140 May 22 2023 Index -rwx------ 1 wesley wesley 4.2K May 22 2023 old.adult.names
## Vamos conhecer um pouco melhor os arquivos?
!head -n 15 adult/*
==> adult/adult.data <==
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K
42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K
37, Private, 280464, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 80, United-States, >50K
30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K
23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K
32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K
40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K
==> adult/adult.names <==
| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
| Data Mining and Visualization
| Silicon Graphics.
| e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database. A set of
==> adult/adult.test <==
|1x3 Cross validator
25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.
38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.
28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.
44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K.
18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K.
34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.
29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K.
63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K.
24, Private, 369667, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K.
55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K.
65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K.
36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.
26, Private, 82091, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 39, United-States, <=50K.
58, ?, 299831, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, <=50K.
==> adult/Index <==
Index of adult
02 Dec 1996 140 Index
10 Aug 1996 3974305 adult.data
10 Aug 1996 4267 adult.names
10 Aug 1996 2003153 adult.test
==> adult/old.adult.names <==
1. Title of Database: adult
2. Sources:
(a) Original owners of database (name/phone/snail address/email address)
US Census Bureau.
(b) Donor of database (name/phone/snail address/email address)
Ronny Kohavi and Barry Becker,
Data Mining and Visualization
Silicon Graphics.
e-mail: ronnyk@sgi.com
(c) Date received (databases may change over time without name change!)
05/19/96
3. Past Usage:
(a) Complete reference of article where it was described/used
@inproceedings{kohavi-nbtree,
author={Ron Kohavi},
Uma análise inicial dos arquivos mostra que os dados que interessam para a análise estão nos arquivos adult.test e adult.data.
O arquivo de teste precisa ter a primeira linha ignorada. Os nomes das colunas estão no arquivo adult.names:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- above50k.-etherlands.
Para fazer a carga inicial utilizaremos o Pandas.
O código a seguir importa a biblioteca Pandas, define uma lista com os nomes das colunas e importa o aquivo de dados em um objeto DataFrame. Um cuidado especial foi tomado para que o Pandas interprete os valores indisponíveis, que nos documentos aparecem marcados com um sinal de interrogação.
import pandas as pd
features = 'age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country above50k'.split()
df = pd.read_csv('adult/adult.data', names=features, na_values=['?',' ?',' ?'], keep_default_na=True)
df.sample(20)
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | above50k | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11118 | 36 | Private | 58602 | 5th-6th | 3 | Never-married | Other-service | Not-in-family | White | Male | 0 | 0 | 35 | United-States | <=50K |
| 27689 | 49 | Private | 200198 | HS-grad | 9 | Married-civ-spouse | Other-service | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 14198 | 61 | Private | 273803 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 8563 | 28 | Private | 194690 | 7th-8th | 4 | Separated | Other-service | Own-child | White | Male | 0 | 0 | 60 | Mexico | <=50K |
| 775 | 23 | Local-gov | 282579 | Assoc-voc | 11 | Divorced | Tech-support | Not-in-family | White | Male | 0 | 0 | 56 | United-States | <=50K |
| 31725 | 30 | Private | 147596 | Some-college | 10 | Divorced | Adm-clerical | Unmarried | Black | Female | 0 | 0 | 40 | United-States | <=50K |
| 27644 | 56 | Local-gov | 273084 | Masters | 14 | Married-civ-spouse | Exec-managerial | Husband | Black | Male | 0 | 0 | 40 | United-States | >50K |
| 28774 | 51 | Private | 123053 | Masters | 14 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 5013 | 0 | 40 | India | <=50K |
| 13248 | 68 | Private | 168794 | Preschool | 1 | Never-married | Machine-op-inspct | Not-in-family | White | Male | 0 | 0 | 10 | United-States | <=50K |
| 453 | 42 | Private | 197583 | Assoc-acdm | 12 | Married-civ-spouse | Exec-managerial | Husband | Black | Male | 0 | 0 | 40 | NaN | >50K |
| 17451 | 61 | Private | 85194 | Some-college | 10 | Married-civ-spouse | Tech-support | Husband | White | Male | 0 | 0 | 25 | United-States | <=50K |
| 17223 | 20 | Private | 54012 | HS-grad | 9 | Never-married | Handlers-cleaners | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 30502 | 60 | Private | 198727 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 30 | United-States | <=50K |
| 28657 | 46 | Private | 465974 | 11th | 7 | Never-married | Transport-moving | Own-child | White | Male | 0 | 0 | 30 | United-States | <=50K |
| 20179 | 54 | Private | 97778 | 7th-8th | 4 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 11659 | 59 | Private | 147989 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | NaN | <=50K |
| 23774 | 59 | Private | 98361 | 11th | 7 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 662 | 27 | Private | 111900 | Some-college | 10 | Never-married | Prof-specialty | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 28496 | 31 | Self-emp-not-inc | 265807 | Assoc-voc | 11 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 3137 | 0 | 50 | United-States | <=50K |
| 18979 | 44 | Private | 86298 | HS-grad | 9 | Divorced | Craft-repair | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
Há colunas numéricas e categóricas. Cada tipo de dado precisa ser analisado de forma distinta.
Vamos criar uma função para processar diferentes tipos de variáveis e mostrar as estatísticas descritivas. O que queremos saber de cada tipo de variável:
Categórica¶
- Quantas linhas tem?
- Quantas linhas faltando?
- Proporção entre linhas válidas e NA (gráfico)
- Quantas categorias tem?
- Quais são as categorias?
- Qual é a moda?
- Quantos registros tem por categoria? (gráfico de barras)
Numérica¶
- Quantas linhas tem?
- Quantas linhas faltando?
- Proporção entre linhas válidas e NA (gráfico)
- Mínimo, máximo, média, mediana, desvio-padrão
- Histograma
- Boxplot
import plotly.express as px
import plotly.graph_objects as go
def analisa_var_categorica(nome, dataframe):
v = dataframe[nome]
na = sum(v.isna())
print('\n\n')
print('=' * 100, '\n')
print(f'A variável {nome} possui {len(v)} linhas.')
print(f'Deste total, {na} linhas contém valores nulos (NA/NaN/etc).')
fig1 = px.pie(values=[len(v)-na, na], names=['Válidos', 'Nulos'], hole=.6, width=800)
fig1.show()
print(f'Ao todo, a série de dados possui {len(v.unique())} categorias distintas. São elas:')
fig2 = px.bar(v.value_counts(), width=800)
fig2.show()
print(f'A moda desta variável é o valor {v.mode()[0]}.')
def analisa_var_numerica(nome, dataframe):
v = dataframe[nome]
na = sum(v.isna())
print('\n\n')
print('=' * 100, '\n')
print(f'A variável {nome} possui {len(v)} linhas.')
print(f'Deste total, {na} linhas contém valores nulos (NA/NaN/etc).')
fig1 = px.pie(values=[len(v)-na, na], names=['Válidos', 'Nulos'], hole=.6, width=800)
fig1.show()
print(f'Mínimo: {v.min()}')
print(f'Máximo: {v.max()}')
print(f'Média: {v.mean()}')
print(f'Mediana: {v.median()}')
print(f'Desvio padrão: {v.std()}')
fig2 = px.histogram(v, width=800)
fig2.show()
fig3 = px.box(v, width=800)
fig3.show()
def analisa_dataframe(df):
for c in df.columns:
if df[c].dtype.kind in 'fi':
analisa_var_numerica(c, df)
else:
analisa_var_categorica(c, df)
analisa_dataframe(df)
==================================================================================================== A variável age possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 17 Máximo: 90 Média: 38.58164675532078 Mediana: 37.0 Desvio padrão: 13.640432553581341
==================================================================================================== A variável workclass possui 32561 linhas. Deste total, 1836 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 9 categorias distintas. São elas:
A moda desta variável é o valor Private. ==================================================================================================== A variável fnlwgt possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 12285 Máximo: 1484705 Média: 189778.36651208502 Mediana: 178356.0 Desvio padrão: 105549.97769702224
==================================================================================================== A variável education possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 16 categorias distintas. São elas:
A moda desta variável é o valor HS-grad. ==================================================================================================== A variável education_num possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 1 Máximo: 16 Média: 10.0806793403151 Mediana: 10.0 Desvio padrão: 2.5727203320673877
==================================================================================================== A variável marital_status possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 7 categorias distintas. São elas:
A moda desta variável é o valor Married-civ-spouse. ==================================================================================================== A variável occupation possui 32561 linhas. Deste total, 1843 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 15 categorias distintas. São elas:
A moda desta variável é o valor Prof-specialty. ==================================================================================================== A variável relationship possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 6 categorias distintas. São elas:
A moda desta variável é o valor Husband. ==================================================================================================== A variável race possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 5 categorias distintas. São elas:
A moda desta variável é o valor White. ==================================================================================================== A variável sex possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 2 categorias distintas. São elas:
A moda desta variável é o valor Male. ==================================================================================================== A variável capital_gain possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 0 Máximo: 99999 Média: 1077.6488437087312 Mediana: 0.0 Desvio padrão: 7385.292084840338
==================================================================================================== A variável capital_loss possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 0 Máximo: 4356 Média: 87.303829734959 Mediana: 0.0 Desvio padrão: 402.9602186489998
==================================================================================================== A variável hours_per_week possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 1 Máximo: 99 Média: 40.437455852092995 Mediana: 40.0 Desvio padrão: 12.347428681731843
==================================================================================================== A variável native_country possui 32561 linhas. Deste total, 583 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 42 categorias distintas. São elas:
A moda desta variável é o valor United-States. ==================================================================================================== A variável above50k possui 32561 linhas. Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 2 categorias distintas. São elas:
A moda desta variável é o valor <=50K.
Predição¶
Vamos dividir a amostra entre dados de treino e teste e utilizar um algoritmo de propósito geral com boa eficiência para este tipo de problema, como as Decision Trees e Random Forests, por exemplo. Um ponto positivo destes algoritmos é que eles trazem a importância que cada feature tem para o resultado final no modelo.
df.sample(15)
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | above50k | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18109 | 58 | Self-emp-not-inc | 193434 | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 20 | United-States | <=50K |
| 4613 | 77 | NaN | 232894 | 9th | 5 | Married-civ-spouse | NaN | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 26503 | 74 | NaN | 89667 | Bachelors | 13 | Widowed | NaN | Not-in-family | Other | Female | 0 | 0 | 35 | United-States | <=50K |
| 5385 | 61 | Private | 173924 | 9th | 5 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | Puerto-Rico | >50K |
| 16240 | 37 | Private | 183345 | Assoc-voc | 11 | Never-married | Craft-repair | Not-in-family | White | Male | 0 | 0 | 45 | United-States | <=50K |
| 12047 | 37 | Private | 171090 | 9th | 5 | Married-civ-spouse | Machine-op-inspct | Wife | Black | Female | 0 | 0 | 48 | United-States | <=50K |
| 1942 | 24 | Local-gov | 249101 | HS-grad | 9 | Divorced | Protective-serv | Unmarried | Black | Female | 0 | 0 | 40 | United-States | <=50K |
| 16188 | 58 | Private | 310320 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
| 21748 | 38 | Private | 229236 | HS-grad | 9 | Married-civ-spouse | Transport-moving | Husband | Other | Male | 0 | 0 | 40 | Puerto-Rico | <=50K |
| 19386 | 58 | Self-emp-not-inc | 130714 | Masters | 14 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
| 15405 | 49 | Local-gov | 199378 | HS-grad | 9 | Married-civ-spouse | Protective-serv | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 25410 | 45 | Local-gov | 53123 | 11th | 7 | Married-civ-spouse | Other-service | Wife | White | Female | 0 | 0 | 25 | United-States | <=50K |
| 22313 | 26 | Self-emp-not-inc | 258306 | 10th | 6 | Married-civ-spouse | Farming-fishing | Husband | White | Male | 0 | 0 | 99 | United-States | <=50K |
| 30630 | 53 | Private | 133219 | HS-grad | 9 | Married-civ-spouse | Other-service | Husband | Black | Male | 4386 | 0 | 30 | United-States | >50K |
| 18717 | 26 | Private | 162302 | Some-college | 10 | Never-married | Machine-op-inspct | Not-in-family | Asian-Pac-Islander | Male | 0 | 0 | 40 | Philippines | <=50K |
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(df.drop('above50k', axis=1)), df['above50k'], test_size=0.25)
X_train.sample(15)
| age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | ... | native_country_ Portugal | native_country_ Puerto-Rico | native_country_ Scotland | native_country_ South | native_country_ Taiwan | native_country_ Thailand | native_country_ Trinadad&Tobago | native_country_ United-States | native_country_ Vietnam | native_country_ Yugoslavia | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1222 | 53 | 22154 | 10 | 0 | 0 | 40 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 14377 | 22 | 106700 | 12 | 0 | 0 | 27 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 8562 | 49 | 122066 | 9 | 0 | 0 | 30 | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
| 1270 | 47 | 200734 | 13 | 0 | 0 | 45 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 14360 | 24 | 26671 | 9 | 0 | 0 | 40 | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 26254 | 43 | 45156 | 10 | 0 | 0 | 60 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 12621 | 62 | 197286 | 8 | 0 | 0 | 48 | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
| 25379 | 45 | 127089 | 14 | 0 | 0 | 45 | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 21917 | 54 | 227832 | 8 | 0 | 0 | 40 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 728 | 31 | 223212 | 9 | 0 | 0 | 40 | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
| 32535 | 22 | 325033 | 8 | 0 | 0 | 35 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 11565 | 43 | 150533 | 12 | 0 | 0 | 50 | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| 14798 | 20 | 353195 | 9 | 0 | 0 | 35 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 21934 | 65 | 94552 | 10 | 0 | 0 | 40 | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 22447 | 43 | 195897 | 9 | 7298 | 0 | 40 | True | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
15 rows × 105 columns
y_train.sample(15)
26638 <=50K 14471 >50K 14240 >50K 23534 <=50K 22836 <=50K 9876 <=50K 21230 <=50K 26957 >50K 29012 <=50K 19016 <=50K 12995 <=50K 10781 <=50K 10902 <=50K 24764 <=50K 6730 <=50K Name: above50k, dtype: object
model = RandomForestClassifier()
model.fit(X_train, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
y_hat = model.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
accuracy_score(y_test, y_hat)
0.8556688367522417
confusion_matrix(y_test, y_hat)
array([[5721, 469],
[ 706, 1245]])
print(classification_report(y_test, y_hat))
precision recall f1-score support
<=50K 0.89 0.92 0.91 6190
>50K 0.73 0.64 0.68 1951
accuracy 0.86 8141
macro avg 0.81 0.78 0.79 8141
weighted avg 0.85 0.86 0.85 8141
X_names = X_train.columns
fi = dict(zip(X_names, model.feature_importances_))
fi_sorted = sorted(fi.items(), key=lambda x:x[1], reverse=True)
top_fi = dict(fi_sorted[:15])
fig = px.bar(x=top_fi.keys(), y=top_fi.values(), title="Importância das variáveis")
fig.show()
Pergunta da Professora¶
O desempenho do algoritmo é igual para homens e para mulheres?
homens = df.query('sex == " Male"')
mulheres = df.query('sex == " Female"')
len(homens),len(mulheres)
(21790, 10771)
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(homens.drop('above50k', axis=1)), homens['above50k'], test_size=0.25)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print('===============================')
print('ANÁLISE PARA O GRUPO DOS HOMENS')
print('===============================')
print('Acurácia:', accuracy_score(y_test, y_hat))
print('Matriz de confusão:\n', confusion_matrix(y_test, y_hat))
print('Relatório de classificação:\n', classification_report(y_test, y_hat))
===============================
ANÁLISE PARA O GRUPO DOS HOMENS
===============================
Acurácia: 0.8166299559471366
Matriz de confusão:
[[3395 366]
[ 633 1054]]
Relatório de classificação:
precision recall f1-score support
<=50K 0.84 0.90 0.87 3761
>50K 0.74 0.62 0.68 1687
accuracy 0.82 5448
macro avg 0.79 0.76 0.78 5448
weighted avg 0.81 0.82 0.81 5448
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(mulheres.drop('above50k', axis=1)), mulheres['above50k'], test_size=0.25)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print('=================================')
print('ANÁLISE PARA O GRUPO DAS MULHERES')
print('=================================')
print('Acurácia:', accuracy_score(y_test, y_hat))
print('Matriz de confusão:\n', confusion_matrix(y_test, y_hat))
print('Relatório de classificação:\n', classification_report(y_test, y_hat))
=================================
ANÁLISE PARA O GRUPO DAS MULHERES
=================================
Acurácia: 0.9301893798737467
Matriz de confusão:
[[2339 42]
[ 146 166]]
Relatório de classificação:
precision recall f1-score support
<=50K 0.94 0.98 0.96 2381
>50K 0.80 0.53 0.64 312
accuracy 0.93 2693
macro avg 0.87 0.76 0.80 2693
weighted avg 0.92 0.93 0.92 2693